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Abstract 

For a countable-state Markov decision process we introduce an embedding 
which produces a finite-state Markov decision process. The finite-state embedded 
process has the same optimal cost, and moreover, it has the same dynamics as 
the original process when restricting to the approximating set. The embedded 
process can be used as an approximation which, being finite, is more convenient 
for computation and implementation. 
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1 Introduction 

In this paper we develop a tool that is useful in studying countable state 
Markov Decision Processes (MDPs) [P]. A Markov Decision Process is a 
controlled dynamical system with probabilistic transitions, that are influ- 
enced by the control actions (for precise definitions see §§ 11. II) . We consider 
discrete-time MDPs with a discrete state space X which is either finite or 
countably infinite, to which we will refer in the sequel as countable. The cost 
under consideration is the long-time average cost. 

Countable MDPs are obviously more difficult to study, analytically and 
numerically, than finite state MDPs. Several approaches were developed to 
deal with this issue. The first approach is to reduce the state space by clus- 
tering together "equivalent" states: see e.g. [GDG] and references therein. 
This approach provides a smaller state space and exact relations, but re- 
quires a very special structure of the MDP in order for the derived model 
to have a finite state space. Namely, equivalent states must have the same 
transition probability into and out of the state, under any action, and the 
same immediate cost. This special structure seldom exists in applications. 

A second approach approximates countable MDPs by finite state MDPs 
using a truncation of the state space. Existing results show that as the 
size of the approximating MDP increases, its cost function and, under some 
conditions its optimal policies approach those of the original, countable MDP. 
See, e.g. [Cl lAll IA2j and references therein. This approach is applicable in 
greater generality, but typically provides approximations without an error 
estimate — thus the results are "asymptotic" in nature. 

We propose a different approach, with the advantage that the optimal 
cost of the approximating, finite MDP agrees with that of the countable 
MDP. Moreover, restricted to the approximating set the optimal policies 
agree as well. Thus the term "exact approximations." The main idea is finite 
embedding. In [F] embedding techniques were used to obtain optimal policies 
within various classes. We apply the embedding approach for approximations 
with general, compact action spaces. Section \5\ develops some applications, 
where in some cases exact closed-form expressions can be obtained. 

We conclude this section with a precise statement of the problem. In ^ 
we introduce the main idea — the finite embedding, and prove its existence. In 
[|3]we show that the embedding possesses the desired properties. We discuss 
some extensions in ® 



1.1 Problem formulation 

Consider a process with state space X C {0,1,2,3,...}. When the system 
occupies state i & X, then the controller can influence its behavior by choos- 
ing an action a from the compact action set Ai, i G X, which is a subset of 
the action space A. Choosing an action a & Ai has a twofold effect: 

(i) A running cost c{i, a) is incurred, 

(ii) The system transits from state i to j according to the transition prob- 
ability P{j\i, a). 

Thus an MDP is defined in terms of a quadruplet 

M = {X,{Ai},c{t,a),P{j\t,a)). 

The state and action at time k > are denoted Xk and a^ respectively, so 
that the system's behavior on the infinite time interval is described in terms 
of the stochastic process {{xk,ak)}'kLo- 

Admissible policies. A policy tt is a law which is used to choose the actions 
flfc £ ^Xfe • It is admissible if its choice at time k depends only on the history 
(xo, ao, ..., Xk-i, ctfc-i, Xk) of the system up to time k. A pohcy can be either 
deterministic or randomized, so that the choice of a^ may be made according 
to a probability measure on A^^. A (possibly randomized) policy which 
depends only on Xk is called a "stationary Markov policy". Such policies 
generate a state process {x^I^q which is a Markov chain with stationary 
transition probabilities. 

The cost criterion. An admissible policy vr generates the stochastic pro- 
cesses {xfcj^o ^^d {ofcj^Q, and the expected cost flow 

N-l 

C^(z,7r)=Ef 5^c(xfc,afc), Ar>l. (1.1) 

A:=0 

The expectation Ef in (II. ip is with respect to the probability measure P^ 
induced by tt on the set of sequences {{xk, ak)}'kLo with xq = i. We address 
the optimal control problem of minimizing the functional 

TT ^ Ji{n) = liminf — CAr(i,7r), Xq = i, 

Af— >oo iV 



over all admissible policies, and call Ji(7r) the expected long-run average cost. 
The notation g*{A4) for the optimal cost makes explicit the model A4 under 
consideration. An optimal policy realizes the minimal long-run average cost, 
and an e-optimal policy realizes it up to e. 

We need to exclude one case in which it is not possible to approximate a 
countable MDP by a finite one. 

Definition 1.1 Let a be a stationary Markov policy of an MDP M., gener- 
ating the state-action process {{xk, Ofc)}. We say that a is a drifting policy if 
for every finite set F G X and every initial condition i & F, 

Pi{xk e F) ^ as k ^ oo. (1.2) 

We say that an MDP Ai is a drifting MDP if there exists a constant 6 > 
such that any stationary Markov policy of M., say a, that satisfies 

J{cr)<g''{M) + 6 (1.3) 

is drifting. An MDP is called non-drifting if it is not a drifting MDP. 

Obviously it is not possible to approximate a drifting MDP by a finite MDP 
while preserving both cost structure, optimality and dynamics. Moreover, 
if the MDP is drifting then by Definition 11.11 for a certain 6 > 0, for every 
< u < 6, every stationary Markov policy with average cost smaller than 
g* -{-u, eventually leaves every finite set F with probability 1. But this yields 
the existence of a Markov policy (not necessarily stationary) with average 
cost g* (that is, an optimal policy) which eventually leaves every finite set 
with probability 1. Assuming that this does not happen is the non drifting 
condition. In this paper we will therefore consider only non-drifting MDPs. 
Borkar's coercive condition [BJ requires that the immediate cost is higher than 
g* for "far states" — as in (15.11) — and thus ensures a non-drifting condition. 

2 The embedding 

We wish to associate with the given MDP Ai a finite state MDP A^O) 

Mo = (Xo, Ao, QoiJ\i, a), Co(i, a)) 

with cost fiow C^(i,7r) defined as in (11.11) . in such a manner that the two 
MDPs will share common optimality properties and, in some sense, will share 



transition probabilities. To this end we introduce the notion of embedding, 
whose exact definition is presented below. In §21 we prove that this definition 
implies the desired exact approximation property. 

We will next define the embedding notion, which will be followed by a 
construction of an embedding ^Ao. Denote by {{xk, afc)}^Q the generic state- 
action process of A4 and by {{C,k, «fc)}^o ^^^ generic state-action process of 
A^o- Given a finite subset Z G X define 

T] = mi{k >0:Xk^Z} t = mi{k > t] : Xk e Z} . (2.1) 

Thus if the initial state is in Z then t] is the first exit time and r is the first 
return time, while if the initial state is not in Z then rj = and again r is 
the first return time. For a stopping time u > define 

C„{i, vr) = E^ ^ c{xk, ttk) 

fc=0 

and note that ii u = N, a deterministic integer, this definition agrees with 
(11.11) . (If, however, i/ is a random variable then Cu{i, vr) 7^ CNih tt) \n=u, the 
latter being a random variable.) Given a finite subset Zq C Xq, define 770, tq 
and C° analogously. 

Definition 2.1 We say that Ai^ is embedded in M. if there exist subsets 
Zq C Xq and Z G X , and a one-to-one mapping e : Zq \-^ Z from Zq onto Z 
such that, for any stationary Markov policy a of M. under which Z contains 
at least one recurrent state, the following holds. 

There exists a stationary Markov policy ctq of M.q, such that if the processes 
start with initial states Xq & Z and S,q = e~^(xo) G Zq respectively, then Xr 
and ^T-o '^^'^e identically distributed {under the probability measures P^^ and 
Pp respectively) and, if the term on the right is finite, 

a(xo,a) = C°,(eo,ao). 

Embedding means that if we restrict attention only to the states Z in X 
and to the corresponding states Zq in Xq, then the performance of any sta- 
tionary Markov policy a on A4 can be imitated by the performance of some 
stationary Markov policy ctq on -Mq. This imitation can be achieved when 
considering finite subsets F of the state space X on which a generates a 
nontrivial dynamics, namely the states in F are not all transient under a. 



Remark 2.2 The idea of embedding is a natural generalization of Kac's The- 
orem and the "cham onZ" idea fMT\ Thm. 10.2.3, %10. 4. 2 and 10.6]. Kac's 
Theorem gives the value of the stationary probability of a Markov chain in 
terms of the "cycle times. " Moreover, the stationary distribution of a chain 
can be obtained from that of the chain restricted to a subset Z , again in terms 
of excursion times outside Z . In our case we need to account also for the 
time and costs accrued during the excursion outside of Z . 

We now establish that for certain finite sets Z G X there exists a finite 
state A4o = {Xo,Ao,Pq,co) and an embedding e(-) of A^o in -M.. This em- 
bedding will be useful and significant for stationary Markov policies a which 
induce a nontrivial dynamics on Z. The embedding is such that if e(-) is 
defined on Zq, then Z = c^Zq) and 

#(Xo)=2#(Zo). 

In §3] we will employ the embedding result to establish existence and 
characterize optimal policies for certain countable state MDPs. As we shall 
see there, we need the embedding to be such that, in addition to the cost 
fiow, the expected return times agree, that is, Et = Etq. 

2.1 The existence of embedding 

Theorem 2.3 Let M. be a countable state MDP and let Z d X be a finite 
set. Then there exists a finite state MDP M.q with state space Xq and an 
embedding e : Z^^^ Z of Aio in A4 such that 

#(Xo)=2#(Zo). 

The result has non-trivial content provided that Z G X contains a recurrent 
state of some stationary Markov policy of A4. 

Proof. We will define an MDP A4o, and will then establish that it has the 
properties asserted in the theorem. Denote 



let Zn be a finite set 



Z — {Zi, ..., Zn}, 

Zo = {si,...,s„} (2.2) 



and let e : Zq h^ Z he defined by 

e{si) = Zi, i = 1,2,. ..,n. (2.3) 

With each state Sj in Zq we associate a state Ui, and we then define 

Xq = {Si, ..., Sn} U {UJI, ..., Un}. (2.4) 

We have to specify J^o as a. quadruplet 

(Xo, {Ao)„ co{s, a), Pois'\s, a)) 

and define explicitly Aq, cq and Pq. We define Ao{si) (= (v4o)sJ by 

Ao{si) = A{e{si)) for every 1 < i < n. (2.5) 

The definition of the action sets Ao(co'j) will be given below. 

We next define the transition probabilities Po{s\si, a), s & Xq, a G Ao(sj). 
First, for Si G Zq 

{P{e{sj)\e{si),a) if s = Sj G Zq 

1 - Yl]=i Po{sj\si, a) a s = Ui (2.6) 

if s = Wj, j 7^ i. 

The corresponding cost is defined by 

CQ{si,a) = c{e{si),a) (2.7) 

for a G Ao{si) = A{e{si)). 

We now fix 1 < i < n and define the action sets and the transition 
probabilities for the state Ui G Xq \ Zq. Let cr be a stationary Markov policy 
of A4 such that Z contains at least one recurrent state. Recall the definitions 
( ETTD and let 

qj(a) = P^ixr = Zj I a;^_i = Zi}. (2.8) 

This is the probability that the process {x^} will first enter Z through state 
Zj, conditioned on having left Z from Zi while employing the action a G A{zi) 
specified by a. The action for state Ui induced by a is the collection of n + 1 
nonnegative numbers 

a(a) = (Agi(o-),...,Ag„((T),c(a)) (2.9) 



where the constant A satisfies < A < 1. These two parameters A and 
c(cr) which define the action a{a) will be specified below. The quantities 
gi(cr), ..., qn{cr) are the probabilities associated in fl2.8p with the fixed state Zi 
and the stationary Markov policy a. We define the action set of Ui to be 

AoiuJ^) = \Ja{a), (2.10) 

(T 

where the union is over all the stationary Markov policies under which Z 
contains a recurrent state. Of course, two different policies cri and cr2 may 
give rise to the same action, that is a{ai) = a{(72). 

We now define the transition probabilities from cjj. If a G Aq{uj), then 

a = (ai, . . . 

for some constant < A < 
define Po{s\ui,a) by 



Po{s\ui,a) 



Thus for every choice of < A < 1, the conditional probability to enter 
Zq through Sj, given that the process did enter Zq, is qj, independent of A. 
However, the value of A determines the expected time that would elapse until 
entrance, and we choose A in such manner that this expected time turns out 
to be equal to the corresponding time for A4. Namely, if (Tq is the policy that 
uses a{a) in state cUj, then A is chosen such that 

l + E>o = i?:.[rh=l]. (2.13) 



Remark 2.4 If Zi is not accessible from the state in Z which is recurrent 
under a then we do not need to specify actions for uji, while if it is accessible 
then necessarily it is recurrent, so that the expectation on the right-hand side 
of ^MUW ^'5 finite. An appropriate value of A can clearly be chosen since for 
A = 1 the left-hand side equals 1, while as X -^ this expression diverges. 
We note that if it is possible to leave Z from Zi in one step then 

-1 -1 



an,an+i) = (Agi,...,Ag„,c) 


(2.11) 


, and it follows that X]^-i ^j - 


= A. We then 


( aj if s = Sj E Zq 




I 1 — A a s = uJi 


(2.12) 


1 if s ^ Zo U {uJi}. 





i?;jrh = l] = l + 






Xj^Z 



//, however, this is not possible, then we can define the action in AQ{uJi) that 
corresponds to a in an arbitrary manner. Finally note that the normalizing 
constant {in square bracket) above is just PQ{uJi\si,a{a)). 

We next consider the cost associated with the action a, namely co(ciJj,a), 
which is chosen to be such that 

Cr{z,,a)) = Cl{e-\z,),a,)) (2.14) 

holds for all Zi E Z. We distinguish between two cases: If under a the 
process {xk} starting at Zi does not leave Z, then equality holds in fl2.14p by 
definition. If the process does leave Z, than 

T-l 

Cr{zi, a) = El^ ^ c(xfe, ak) 

k=0 

can be expressed in the form 

c{zi,a{zi)) + ^ P{xj\zi,a{zi))Cr{xj,a) + ^ P{zj\zi,a{zi))Crizj,a). 

We recall that co(ti;i, a) is actually the (n + l)te c mponent of a{a) in (12. 9p . 
denoted c(cr). It follows that setting 

co{uJi,a) = [P{uJi\e'^{zi),ao)]''^ ^ P{xj\zi,a{zi))Crixj,a)[E^"^To]'^ 

(2.15) 
ensures the desired equality (I2.14p . Thus to each a there corresponds an 
action a = a(o"), and a cost 

co{uJi,a) = an+i = c(cr) 

which in view of the explicit expression (I2.15p . indeed depends on the state 
and action alone. 

The definition of A^o is thus complete, and it follows from the definition 
that TWo indeed has the properties asserted in the theorem. D 

Remark 2.5 Note that the calculation of Cr is very similar to (and is as 
complicated as) that of the relative value function in the optimality equation. 
More precisely, since we are dealing with a single policy, this is related to the 
solution of the Poisson equation: see e.g. \MS[ Thm. 9.5]. 

9 



3 Existence of optimal policies 

In order to use the embedding result it is needed that some optimal policy 
will have a recurrent state within some finite set. The following is a simple 
condition under which there exists a finite subset Z as required in the theorem 
presented in the previous section. Suppose that we have an estimate 

g\M)<^, (3.1) 

for some 7, and moreover, for some ordering of the states {xj\ the following 
holds: 

liminf {min{c(x,-, a) : a G A{xj)W > 7. (3.2) 

It clearly follows from fl3.ll) and (13. 2p that if J{a) < 7, then some finite set 
Z contains a recurrent state of a. Such a set is, e.g., 

Z = {x : min{c(xj, a) : a E A{xj)} < 7}. (3.3) 

An estimate as in (13.11) does not require a computation of the optimal policy, 
but can be obtained by restricting to a special type of policies. 

Fix some state z and denote s = e^^(z). Define z/ = infjfc > : x^ = z}, 
and let z/q be defined analogously. A sum such as the cycle cost Cjy(z,cr) 
below is well-defined if it is finite when c is replaced by its absolute value \c\. 

Theorem 3.1 Fix a stationary Markov policy a such that z E Z is recurrent 
under a , and let M.^ he an embedding as above. Let ctq be the stationary 
Markov policy of M.o associated with a. If 

C^{z, d) = E^ ^ c(xfc, flfc) 

A:=0 



is well defined then 

— Cm(z, a) = lim — 



lim -^CNiz,a) = lim ^^C]^(e ^{z),ao] 



Remark 3.2 Conditions under which the average cost does not depend on 
the initial state are standard, and therefore we shall not elaborate on this 
point. 

The proof applies without change when the condition that Ci,{z, a) is well 
defined is replaced by the condition that the immediate costs c{x, a) are all 
nonnegative. 

10 



Proof: We denote s = e~^{z), and by construction s is recurrent under aQ. 
We note that Sj is accessible from s under o"o if and only if Zj = e{sj) is 
accessible from z under a. Therefore we may assume that all states in Z 
(resp. Zq) communicate, and ignore transient states. It is also convenient 
to ignore (or remove from Z) states from which z can be reached only by 
leaving Z. 

Let the random times u and z/q be as in the sentence that precedes Theo- 
rem 13.11 Under the recurrence assumption the limit of the average cost flow 
exists. It follows from Theorem 17.2.1 of [ MTj that under a the process pair 
{xk, ttk} possesses an invariant probability measure which we denote by n. 
Moreover, almost surely under P^ we have 

i™ 7? Z^ c O^fc, «fc = ^ = E^c{x, a , 

k=0 ^ 

and analogously for M.q: 

fc=0 ^ 

In view of our assumptions concerning recurrence and existence of cycle costs, 
this implies that 

limlc^(.,a)-''^^^^-°^(^'='"'^^ 



N-^oo N E'^V 

and similarly for C°. It therefore suffices to establish that 

El Elll <^k, ak) E^o J:Zo co(a, c^k) 
E-u E^'^uo 



(3.4) 



We note that the numerators in (13.41) are the cycle costs corresponding to z 
and s, assumed to be well defined. We first deal with the numerator in the 
left-hand side of (13.41) . Let I a denote the indicator of the set A, that is 

. . _ J 1 if X e A 
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Recalling the definitions of r] and r we have 

k=0 

= El ^ c(xfc, Qk) + EP{r,<u} ( ^ c{xk, ak) + J^ c(xfc, afc) 

fc=0 \k=ri k=T 

By the construction of the embedding, 

fc=0 fc=0 

since, while x^ is in Z, both transition probabilities and immediate costs 
agree. Also 

T—l TO — 1 

EzhvO^} Yl ^*^^'^' "*^) = ^r^{r,o<H.} XI ^^^^' "'^) 
fc=?7 k=rio 

by the definition of the costs Co(a;j, a). Finally, using the Markov property, 

u-l 

KI{r,<y} Y ^(^fc' "fc) = $Z ^^''(^ ^ '^' ^^ ^ ^j)C^(^j, Cr) . (3.5) 

k=T -^jSZ 

Now write 

P^{ri <U, Xr = Zj) = Y P^iV <U, Xr= Zj \ X^_i = Zi)P^{Xn-l = Zi) . 

Recalling that v is the return time to state z^ we express the first probability 
on the right-hand side as 

P^{7] < U, Xr = Zj I X^_i = Zi) = P^{Xt J^ Z, 1 <t <r], Xr = Zj \ X,,^i = Zi) . 

We now observe that the right hand side describes the conditional probability 
of two events: one before the conditioning, one after. Since this is a Markov 
process we have conditional independence and so 

P^{r] < U, Xr = Zj I Xrj-l = Zi) = P^ [t] < V \ X^_i = Z,^-P^{x^ = Zj \ X^-l = Zi) . 

12 



It follows from fl3.5p and the above computation that 

u-l 
k=T 
= ^ PziV < V I ^r,-\ = Zi)-P"{Xr = Zj \ X,,_i = Zi)-P^{Xr^_i = Zi)-Cy{Zj, 0") 

(3.6) 
Similarly to (13. 5|) we have the following expression for ctq: 

E:h,,<u.} Y. c(^^' «^) = E ^^''"(^-o = ^.)^°o(^.' ^o). (3.7) 

We now repeat the discussion that appears in the text between equations 
(13.51) and (13. 6p for the embedded process. Since all the probabilities and 
conditional probabilities in (13.61) agree with the corresponding quantities of 
the embedded process, in view of (13. 5p and (13. 7p it remains to establish that 

a(%,cr)=C°„(s„ao). (3.8) 

We proceed as before to consider two cases. We compute Cy{zj, a) as follows: 

u-l 

C^{zj, a) = El^ ^ c{xk, ttk) 

k=0 

min(i^,»7)-l r u-l ^ 

= E^ Y. c(^'^' «'^) + ^r. \ hv<u} Y c(^'^' «'^) f • (3-9) 

fc=0 I k=ri J 

The first term once again agrees with the embedded chain. Conditioning on 
the exit point we can write the second term as 

^Zj S ^{r?<^} Y ^^^''^ ^^) f = 5Z ^^1^^ ^ ^' ^''-1 " ^i)Cu{Zi, a), (3.10) 

I k=ri J Zi^Z 

and similarly 

V k=rjo J Si£Zo 

(3.11) 
13 



By construction 

and since z is accessible from Zj, PZiv < z^) < 1- Iterating equations fl3.9p . 
fl3.10p and (13. lip we see that the costs for both models agree, up to a last 
term that goes to zero geometrically fast with the number of iterations. It 
follows that the numerators of both sides in (13. 4p agree, and the proof for 
the denominators is similar. Thus (13.40 is established, and the proof of the 
theorem is complete. D 

Theorem 3.3 Let M. he a Markov Decision Process, and suppose that the 
state z is recurrent under a stationary Markov optimal policy a, and that the 
cycle cost is finite. Then for any embedding such that z E Z , the optimal 
cost of M.Q agrees with that of J^. Moreover, M.q has an optimal policy ctq 
that agrees with a on corresponding states of Z and Zq. 

The theorem assumes explicitly that there exists a stationary Markov optimal 
policy. This holds for most applications: for conditions see for example [P] 
and references therein. 



4 Extensions 

First, note that the requirement that the cycle cost is finite holds whenever 
the immediate costs are bounded, since we assume recurrence. Moreover, as 



noted in Remark 13. 2[ this requirement is not needed when the immediate 
costs are all of the same sign (positive or negative). 

The following result is immediate, but nonetheless useful. 

Theorem 4.1 Fix some i. Suppose the stationary policies a and a' are 
such that the associated actions a[a) and a(cr') have costs starting at uOi that 
satisfy 

an+i{(y) = c{a) > c{a') = an+i{<y'), 

while 

a;j(cr) = aj(cr') for every i = 1, . . . ,n. 

Then the action a{a) may be eliminated from AQ{uji). 
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This is quite clear from the definition, and in fact this follows from standard 
results of action elimination in MDPs [L]. 

Next note that, even if the excursion costs Cr are difficult to calculate, 
any approximation of Cr and of the mean excursion times leads to a non- 
exact, approximate embedding, in the sense that optimal costs are not equal 
anymore. However, it is easy to see that the approximation is continuous in 
the sense that as the approximations of C^ and Et improve, the costs (in- 
cluding optimal costs) of the embedded model approach those of the original 
MDP. 

We now outline the extension to constrained MDPs, where a detailed 
description of the model may be found in [A2j . In addition to the usual four 



components of an MDP we define a collection of immediate cost functions 
{dk{x^ a), fc = 1, . . . , K}. Define Jf (vr) in the same way that Ji{n) is defined, 
but with dk replacing the immediate cost c. The constrained optimization 
problem is to minimize the functional Jiln), subject to the constraints 



jnn) < \4 



for some prescribed constants Vk, 1 < k < K. 

Standard approximations of constrained problems are more difficult to 
handle and establish than unconstrained approximations of optimization 
problem. The reason for this is that when we require the approximate model 
to satisfy the hard constraints 

J^^\n)<Vk,k = l,2,...,K, 

then clearly we may lose continuity, in the sense that even if the original 
problem is feasible (that is, there exist policies satisfying the constraints), an 
approximation of the required type may not be feasible |A2j . However, using 
our exact approximation, this difficulty does not arise. 

The embedding results hold for this model, with the following minor 
modification. Recall the definitions (12. lip and (I2.15P of the action in A^o- 
Define d'^la) as in (12.151) and define a{a) by 

a = (ai, . . . , a„, a„+i, a„+2, . . . , On+K+i) = (Agi, . . . , Xqn, c,d ,. . . ,d ). 

(4.1) 
Then the same arguments show that for the embedded chain, all costs agree 
with those of the original model, so that we may approximate the countable 
chain by a finite chain. 

15 



Note, however, that this model is much less robust. Whereas small errors 
in the calculation of the cycle cost for the optimization problem may, in 
the worst case, lead to sub-opt imality, in the constrained case such errors 
may lead to infeasibility, that is, violation of the constraints. Thus, if cycle 
costs can not be computed exactly, this approximation shares the infeasibility 
problem with other, more traditional approximation methods. 

5 Examples 

Example 5.1 Markov Decision Processes serve as a common model in the 
control of dams and reservoirs |LBj . It is standard to discretize both space 
and time, in order to arrive at a manageable model. So, let us model the 
inflow into the water reservoir by a Markov chain Dj and let Lt denote the 
water level at the reservoir. At each epoch (usually month), the decisions 
are to use Zt of the water for electricity generation, and evacuate Yt through 
spillways. The water level is then given by 

Lt+i = U + Dt-{Yt + Zt). 

Denote Xt = {Lt,Dt) and at = {Yt,Zt). The revenue (negative cost) in the 
model arises from the sale of electrical power. The amount of power produced 
is a linear function of the amount of water Zt used for generation, but also 
depends non-linearly on the water level Lt (since higher water level entails 
higher energy per unit of water). Thus the cost c = c{l, z) depends on the 
state and action taken. Since Dt is a Markov chain, we obtain an MDP 
with discrete state process Xt and discrete action process at (assuming that 
the discretization of Dt is compatible with that of Yt and Zt). Both control 
variables are positive and bounded due to practical considerations (limited 
capacity of the generators, and of the spill mechanisms), so that the number 
of control actions is finite. Denote the maximal allowed value of Yt (resp. Zt) 
by y (resp. z). 

Since water levels which are too high may pose danger, if water level and 
inflow rates are too high, the maximal value of the control variables must 
be used, that is Yt = y and Zt = z. This may be formulated by defining 
a finite set F C X so that if the state (/, d) is outside this set, the allowed 
action is only {y,z). It is therefore natural to use our embedding results so 
that only states in F need be considered. We note that since the state space 
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is two dimensional, the number of states that we ignore (outside F) may be 
significant. 

The case where Dt is i.i.d. is particularly simple, since in this case the 
state process is one-dimensional, and outside F the process behaves exactly 
like a random walk. Therefore, results on random walks may be used to 
calculate the entrance distribution and mean time. In fact, in the simple 
case that the cost satisfies c{l,z) = c, a constant, for water level outside of 
F, the cycle cost is just a constant multiple of the average return time. For 
an explicit calculation in a simple case, see the next example. 

In water reservoir applications, the object for optimization is often a group 
of reservoirs: there could be dozens of reservoirs, all connected. In this case 
the "curse of dimensionality" makes it impossible to solve such models, and 
additional approximations are required. For a step in this direction see the 
multi-dimensional queueing problem below. 

Example 5.2 In the simplest case where Dt are Bernoulli, and where the 
release Yt + Zt can only take the values or 1 (with some probability which 
we can choose), we arrive at a generic model of operations research. Consider 
the problem of controlling a single queue. New jobs arrive according to an 
i.i.d. sequence of Bernoulli random variables with mean A, and join an infinite 
queue. The job at the head of the queue is served, and (independently) the 
probability of completion of service (representing the speed of service) is the 
control variable a. Assume that there exists some J > and fj, > \ > 6 so 
that 

A{x) = [6, /i] for X < I and A(x) = {/i} for x > L 

That is, service rate is controlled for small queue size, but maximal rate 
must be used if the queue is large. If we assume further that, for x > I the 
immediate cost is sub-linear, that is, it satisfies c(x, a) = c{x, jj) < c ■ x for 
some positive c, then all our assumptions hold: state is recurrent under any 
policy and cycle costs are finite. The embedded model M.q can be computed 
as follows. Set Xq = {0, 1, ...,/ — 1}. Since the process moves at most one 
step per unit time, the only way to exit this set is through state /, so we can 
set a; = J. To calculate cq, we need 

r-l 

Cr{I, (t)= Ei^ c{Xk, n) 
fc=0 

which in fact does not depend on a, since the only available action is /i. This 
expression is identical to a standard queue, without control, with arrival 
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rates A and service rate /i. However, for the standard queue, transition 
probabilities for x 7^ do not depend on the state x. So, denote by E^ 
expectation for the standard queue, and we can calculate 



a(/,^) = i?o°5Z^(^'^-^"") 



fc=0 



But this equals EqT-J where J is the average cost for a standard queue. This 
can be calculated in terms of the stationary distribution vr of the standard 
queue, that is 



a(/,a) = [E°r]-i5;]7r,c(x,/x). 



x=0 



Define p = A(l — fi)/fi{l — A). Then it is easy to check that for the stan 
dard queue, vr^ = (1 — p)p^. On the other hand, by Kac's Theorem |MT 



Thm. 10.2.3], EqT = [ttq]^"'^ = 1/(1 — p) so that we have an explicit expres- 
sion for Cr{I, cr). Clearly, this can be calculated analytically for various costs 
functions c, and in particular for linear costs. 

This example can be extended as follows. Suppose we do not assume that 
the action space is restricted for x > I. Instead assume that 

c{x,a) < m.m{c{y,a) : a G Ay, y > 1} (5.1) 

for all X < J and all a. It follows that p is the optimal action at x > /, and 
we can apply Theorem 14.11 so that the previous conclusions apply. 

In general, since this is a skip-free, one-dimensional problem, our results 
allows an easy decoupling of the behavior for x < I from that for x > I. The 
situation is more complicated if the skip-free assumption is violated, namely 
either batch arrivals or batch service or both are allowed. However, as is 
clear from the proof of Theorem 13.31 we can write an implicit expression for 
the cost using the cost flows until the first hitting time of {0, 1, ...,/ — 1}, 
to obtain an explicit expression for the embedded model. 

Example 5.3 Consider now a multi-dimensional queueing problem. Jobs 
of type 1, . . . ,K arrive according to a ii'-dimensional process B{t) of i.i.d. 
vectors. The kth coordinate represents arrivals of customers of type k. Cus- 
tomers join infinite queues, one queue for each type. A single server chooses 
at each point in time which queue to serve, and serves the job at the head 



of the line. If job of type k is served, then the service will succeed with 
probability /ifc, and then the job will leave the queue. 

If we impose the condition that some queue must be served as long as 
not all queues are empty, then the empty state will be recurrent under mild 
conditions. For example, it is sufficient to assume that the total number of 
arrivals at any unit time interval is bounded, and that the condition 

Let Q{t) be the vector of queue-sizes at time t. This is the state of our MDP, 
and the control is the choice of queue to serve. This is easily seen to be an 
MDP, once immediate costs c{q, a) are chosen. It is then natural to choose 

K 



^=<^Q:$^QfcW>go 



fc=i 

and approximate the infinite model with a finite one. 
Suppose for some Qo we have that if 

K 

fc=i 

then 

c{x,a) = y^^CkQh 



K 

k=l 

for some positive coefficients {ck}- Then the system simplifies considerably, 
and the computation of the hitting distributions and costs, required for our 
approximation, become feasible |BMMl IW] . 
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